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Abstract. Symmetric tensor operations arise in a wide variety of computations. However, the 
benefits of exploiting symmetry in order to reduce storage and computation is in conflict with a desire 
to simplify memory access patterns. In this paper, we propose Blocked Compact Symmetric Storage 

£f-} wherein we consider the tensor by blocks and store only the unique blocks of a symmetric tensor. 

We propose an algorithm-by-blocks, already shown of benefit for matrix computations, that exploits 
this storage format. A detailed analysis shows that, relative to storing and computing with tensors 

^v^j without taking advantage of symmetry, storage requirements are reduced by a factor 0(m\) and 

computational requirements by a factor 0(m), where m is the order of the tensor. An implementation 
demonstrates that the complexity introduced by storing and computing with tensors by blocks is 
manageable and preliminary results demonstrate that computational time is indeed reduced. The 
paper concludes with a discussion of how the insights point to opportunities for generalizing recent 
I advances for the domain of linear algebra libraries to the field of multi-linear computation. 

CO 

1. Introduction. A tensor is a multi-dimensional or m-array. Tensor computa- 
tions are increasingly prevalent in a wide variety of applications [15] . Alas, libraries 
for dense multi-linear algebra (tensor computations) are in their infancy. The aim of 
this paper is to explore how ideas from matrix computations can be extended to the 

_C domain of tensors. 

Libraries for dense linear algebra (matrix computations) have long been part of 
the standard arsenal for computational science, including the Basic Linear Algebra 

i— — i Subprograms interface [TBI 113 IH1 EH IH] , L APACK [3] , and more recent libraries with 

similar functionality, like libflame [29l [25] . and the BLAS-like Interface Software 
framework (BLIS) [26]. For distributed memory architectures, the ScaLAPACK [8], 

^-j- PLAPACK [23], and (more recently) Elemental [20] libraries provide most of the 

■^j- functionality of the BLAS and LAPACK. High-performance implementations of these 

t" — libraries are available under open source licenses. 

For tensor computations, no high-performance general-purpose libraries exist. 
The MATLAB Tensor Toolbox gl |3J defines many of the kernels (i.e., the inter- 
face) that would be needed by a library for multilinear algebra, but does not have any 
high-performance kernels nor special computations or data structure for symmetric 
tensors. The Tensor Contraction Engine (TCE) project [6] focuses on sequences of 
tensor contractions and uses compiler techniques to reduce workspace and operation 
counts. The very recent Cyclops Tensor Framework (CTF) [53] focuses on exploit- 
ing symmetry in storage for distributed memory parallel computation with tensors, 
but does not investigate the sequential libraries that will invariably be needed on the 
individual nodes. 

In a talk at the Eighteenth Householder Symposium meeting in Tahoe City, CA 
(2011), Prof. Charlie Van Loan stated "In my opinion, blocking will eventually have 
the same impact in tensor computations as it does in matrix computations." This pa- 
per provides early evidence in support of this statement. The approach we take in this 
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paper heavily borrows from the FLAME project |25j . We use the change- of -basis op- 

eration, also known as a Symmetric Tensor Times Same Matrix (in all modes) (sttsm) 
operations [4 , to motivate the issues and solutions. We propose algorithms that re- 
quire minimal computation by computing and storing temporaries. Additionally, the 
tensors are stored by blocks, following similar solutions developed for matrices [T71I22] . 
The algorithms are reformulated to operate with these blocks. Since we need only 
store the unique blocks, symmetry is exploited at the block level (both for storage and 
computation) while preserving regularity when computing within blocks. The paper 
analyzes the computational and storage costs, demonstrating that the simplicity need 
not adversely impact the benefits derived from symmetry. An implementation shows 
that the insights can be made practical. The paper concludes by listing opportunities 
for future research. 

2. Preliminaries. We start with some basic notation, definitions, and the mo- 
tivating tensor operation. 

2.1. Notation. In this discussion, we assume all indices and modes are num- 
bered starting at zero. 

The order of a tensor is the number of ways or modes. In this paper, we deal only 
with tensors where every mode has the same dimension. Therefore, we define rI" 1 '™! 
to be the set of real-valued order-m tensors where each mode has dimension n, i.e., a 
tensor A £ r!" 1 '™] can be thought of as an m-dimensional cube with n entries in each 
direction. As the order of a tensor corresponds to the number of ways or modes of a 
tensors, we sometimes refer to an order-m tensor as an m-way tensor. 

An element of A is denoted as a io ... im _ 1 where i k e { 0, . . . , n — 1 } for all k G 
{ 0, . . . , m — 1 }. This also illustrates that, as a general rule, we use lower case Greek 
letters for scalars (a, x, ■ ■ bold lower case Roman letters for vectors (a, x, . . .), bold 
upper case Roman letters for matrices (A,X, . . .), and upper case scripted letters for 
tensors {A, DC, . . .). We denote the ith row of a matrix A by aj . If we transpose this 
row, we denote it as 2^. 

2.2. Symmetric tensors. A symmetric tensor is a tensor that is invariant under 
any permutation of indices. 

Definition 1 (Symmetric tensor). A tensor A E Rl m ' n l i s symmetric if its 
entries remain constant under any permutation of the indices, i.e., 

a *o-*'(m-i) = aj o--i (m _i) Vij G { 0, . . . , n - 1 } for j = 0, . . . , m - 1, and ir G IL„, 

where i! is result of applying the permutation ir to the index i, and H m is the set of 
all permutations of the set { 0, . . . , m — 1 }. 

For example, an order-2 symmetric tensor is nothing more than a symmetric matrix. 
Similarly, an example of a symmetric order-3 tensor is a tensor A G B^ 3 '™' with 
entries that satisfy ^ioi\i^ — ^^0^2*1 — ^^1*0*2 — ^^1*2*0 — ^^2*0*1 — ^^2*1^0 
*o> ii, «2 G { 0, . . . , n — 1 }. 

Since most of the entries in a symmetric tensor are redundant, we need only store 
a small fraction to fully describe and compute with a given tensor. Specifically, the 
number of unique entries of a symmetric tensor A G R[ m '™l is [5] 




Hence, we can reduce the storage by approximately ml by exploiting symmetry. 
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2.3. The sttsm operation. The operation used in this paper to illustrate issues 
related to storage of, and computation with, symmetric tensors is the change- of -hasis 
operation 



6 := [A;X,--- ,X] 



(2.1) 



where A. £ W m,n > is symmetric and X £ W xn is the change-of-basis matrix. This is 
equivalent to multiplying the tensor A. by the same matrix X in every mode. The 
resulting tensor 6 £ Rl m >p] i s defined elemcntwise as 



7j - 



E- E 

40=0 



-lPjoioPjxii ' ' ' Pi 



=0 



where jk £ {0, ... ,p — 1} for all k £ { 0, . . . ,m — 1 }. It can be observed that the 



resulting tensor C is itself symmetric. We refer to this operation (2.1) as the sttsm 
operation. 

3. The Matrix Case. We will build intuition about the problem and its solu- 
tions by first looking at order-2 symmetric tensors. 



3.1. The operation. Letting m — 2 yields C :— [A;X, X] where A £ 



[m,n] 



IS 



an n x n symmetric matr ix, C £ 



[m,p] 



XAX J . For to = 2, (2.1 ) becomes 



is a p x p symmetric matrix, and [A; X, X] 



n-l 

= x Jo Ax jl = ) 

i =0 



Xjoio (-^- x ji)io 



n-l 

E 

io=0 



n-l 

E 

ii=0 



EE 

a ioiiXjoioXjiii ■ 

(3.1) 

20=0 41=0 



3.2. Simple algorithms. Based on (3.1 ), a naive algorithm that only com putes 
the upper triangular part of symmetric matrix C = XAX T is given in Figure ', 



3.1 



Jleft), 

at a cost of approximately 3p 2 n 2 floating point operations (flops). The algorithm to its 
right reduces flops by storing intermediate results and taking advantage of symmetry. 
It is motivated by observing that 



XAX' 



X AX J 
T 

( x 



\ x p-l 



A( x 



Xp—1 




(3.2) 



where tj = Axj £ M. n and Xj £ M. n (recall that Xj denotes the transpose of the jth 
row of X). This algorithm requires approximately 2pn 2 + p 2 n flops at the expense of 
requiring temporary space for a vector t. 
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Naive algorithms 


Algorithms that reduce computation 
at the expense of temporary space 


A is a matrix (to = 2): C := XAX r = [A; X, X] 


for ji = 0, .. .,p- 1 
for j = 0, . . . , ji 

7icm : = 

for i = 0, . . . , n — 1 
for ii = 0, . . . , n — 1 

Tjoji : ~ Ijaji + a iaiiXjoioXjiii 

endfor 
endfor 
endfor 
endfor 


for ji = 0, . . . ,p- 1 
t := tj x = A.Xj 1 
for jo = 0, . . . , ji 

Ijoh '■— x j T 
endfor 

endfor 



A is a 3-way tensor (to = 3): C := [A; X, X, X] 


for j 2 = 0, .. .,p - 1 
for .71=0,... , j 2 
for j = 0, ... , ji 

for i 2 = 0, . . . , n — 1 
for i\ = 0, . . . , n — 1 
for io = 0, . . . , n — 1 

7joJlj2+ - = 

Q: io*l i 2Xio < oXil i lXi2i2 

endfor 

endfor 


for j 2 = 0,. .. ,p- 1 

:=T$ =A x 2 xl 
for ji = 0, . . . , j 2 

t 1] —t (1) - T (2) Xi x T 
for j = 0, . . . , ji 

c(i) ~T 

endfor 
endfor 
endfor 



A is an m-way tensor: C := [A; X, • • • , X] 


for j TO _i =0,...,p-l 

for j = 0, . . . , ji 

for i m _i = 0, . . . ,n - 1 

for io = 0, . . . , n — 1 

7jo—jm-i"r := 

a io---i2Xjaia ' ' ' Xj m -ii m -i 

endfor 

endfor 


for j m _i = 0, .. .,p - 1 

^ * ^ 3m, — 1 *^ ^fTl— 1 ^j m -i 

for ji = 0,. .. , j 2 

t := tj 1 ...j m l — - 1 Xi 
for j = 0, ... , ji 

?(1) ~T 

endfor 
endfor 

endfor 



Fig. 3.1. Algorithms for C := [.A; X, • • • , X] that compute with scalars. In order to facilitate the 
comparing and contrasting of algorithms, we present algorithms for the special cases where m = 2, 3 
(top and middle) as well as the general case (bottom). For each, we give the naive algorithm on the 
left and the algorithm that reduces computation at the expense of temporary storage on the right. 
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Relative storage of BCSS 
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relative to minimal storage 


n(n + l)/2 
n(n + 6 A )/2 


0.67 


0.80 


0.89 


0.94 


relative to dense storage 


n 2 


1.33 


1.60 


1.78 


1.88 


n(n + b A )/2 



Fig. 3.2. Storage savings factor of BCSS when n = 512. 



3.3. Blocked Compact Symmetric Storage (BCSS). Since matrices C and 
A are symmetric, it saves space to store only the upper (or lower) triangular part 
of those matrices. We will consider storing the upper triangular part. While for 
matrices the savings is modest (and rarely exploited), the savings is more dramatic 
as the tensor order increases. 

To store a symmetric matrix, consider packing the elements of the upper triangle 
tightly into memory with the following ordering of unique elements: 



fO 1 3 

2 4 
5 

V 



'■J 



Variants of this theme have been proposed over the course of the last few decades 
but have never caught on due to the complexity that is introduced when indexing the 
elements of the matrix |13l [5] . Given that this complexity will only increase with the 
tensor order, we do not pursue this idea. 

Instead, we embrace an idea, storage by blocks, that was introduced into our 
libf lame library [T71 US] in order to support algorithms by blocks. Submatri- 
ces (blocks) become units of data and operations with those blocks become units of 
computation. Partition the symmetric matrix A € M. nxn into submatrices as 





/ A 00 


Aqi 


A 02 


Ao(R-l) 


\ 




Am 


A n 


A12 


Ai(fj_i) 




A = 


A 20 


A 21 


A 22 


A 2 (ri_l) 






\A(fi_i)o 


A(fj_i)i 


A(fj_l)2 • 


• A( R _x)( R _ 


J 



(3.3) 



Here each submatrix At, 



pb A xh A 



We define n = n/b A where, without loss 
of generality, we assume 6a evenly divides n. Hence A is a blocked n x n matrix 
with blocks of size 6a x 6a- The blocks are stored using some conventional method 
(e.g., each Aj o5l is stored in column-major order). For symmetric matrices, the blocks 
below the diagonal are redundant and need not be stored (indicated by gray coloring). 
Although the diagonal blocks are themselves symmetric, we will not take advantage 
of this in order to simplify the access pattern for the computation with those blocks. 
We refer to this storage technique as Blocked Compact Symmetric Storage (BCSS) 
throughout the rest of this paper. 
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Storing the upper triangular individual elements of the symmetric matrix A re- 
quires storage of 



n(n + l)/2 = ^ J floats. 

In constrast, storing the upper triangular blocks of the symmetric matrix A with 
BCSS requires 

n(n + b A )/2 = b 2 A f ™ * ^ floats. 

The BCSS scheme requires a small amount of additional storage, depending on 
&a ■ Figure |3.2| illustrates how the storage needed when BCSS is used becomes no 
more expensive than storing only the upper triangular elements (here n = 512) as the 
number of blocks increases. 

3.4. Algorithm-by-blocks. Given that C and A are stored with BCSS, we 
now need to discuss how the algorithm computes with these blocks. Partition A as 



in (3.3| and C and X like 



f Coo • • • C (p-i) \ / X o • • • X ( S _i) 

C = : • . . : and X = : • . , : 

\C(p_l)0 ••• C^.!)^.!)/ \X ( p_ 1)0 ••• X(p_ 1 )( fl _ 1 ) / 

where, without loss of generality, p — p/bc, and the blocks of C and X are of size 
i>c x frc an d be x b A , respectively. Then C := XAX T means that 



/ Aqo • • • A ( fl _i) \ / Xj i0 



Jo(n-l) 



V A (n-1)0 '•• A( fi _i)( fi _i) / \ Xj l(fi 
n— 1 n— 1 

\ \ * sr A vT \ \ 



Ji(n-l) 



n— 1 n— 1 n— 1 n — 1 

XjoToAwiXjA = XI ^[AwriXj^^X^J (in tensor notation). 

*o— Ozi— zo—0 2i=0 

(3.4) 



This yields the algorithm in Figure |3.3[ in which an analysis of its cost is also given. 
This algorithm avoids redundant computation, except within symmetric blocks of C, 



Cj j . Comparing (3.4 1 to (3.1 1 we sec that the only difference lies in replacing scalar 
terms with their block counterparts. Consequently, comparing this algorithm with 
the one in Figure [3~T| (top-right), we notice that every scalar has simply been replaced 
by a block (submatrix). The algorithm now requires n x be extra storage (for T), in 
addition to the storage for C and A. 

4. The 3-way Case. We now extend the insight gained in the last section to the 
case where C and A. are symmetric order-3 tensors, before moving on to the general 
case in the next section. 

4.1. The operation for order-3 tensors. Now C := [7l;X,X, X] where A. £ 
R [m,r»] j g g M [m, P ] ; an d [A; X, X, X] =Ax Xx 1 Xx 2 X. In our discussion, A is a 



Algorithm 



Ops 
(flops) 



Total # of 
times executed 



Temp, 
storage 



for ji = 0, . . . , p — 1 



( \ 


i 


T 

\ T A-i J 


\ 



\ A («-l)0 



A 0(fi-1) 
v («-l)(ft-l) I 



2b a n 2 



b c n 



\ ^Jl(B-l) / 



for jo = 0, ... ,.7i 

C 3031 := ( X X>0 '•■ X Jo("-l) ) 

endfor 
endfor 



2b 2 c n 



p(p + l)/2 



Total Cost: 2b c n 2 p + 2b 2 c n [p(p + l)/2) = ^ ^26^ +1 n 2 - d Q * ^ 2pn 2 + p 2 ™ flops 



Total temporary storage: bc ra = (^j^n 1- ^ cn trics 



Fig. 3.3. Algorithm-by-blocks for computing C := XAX T = [A;X,X]. The algorithm assumes 
that C is partitioned into blocks of size be x ^c, with p = [p/&c~l- 



Algorithm 



for j 2 = 0, . . . ,p - 1 
r (2) 
' 00 



/ J 



r(2) 

0(n-l) 



V J (n-l)0 



7?) 



(«-l)(n-l) . 

yi x 2 ( x j2 o • • • x j2 ( fi _ 1 ) ) 

for Ji = 0, . . . , J 2 

r (2) _ _ _ T (2)^ 



0(n-l) 



r(2) 



... T 



(2) 



for jo = 0, . . . ,ji 

S30J1J2 := 



(fi-l)O " (n-l)(n-l) / 

Xl ( X 3l0 ' ' ' X j!(fi-1) ) 



<0 ( X J ' ' ' X j (n-1) ) 



endfor 
endfor 
endfor 



\ J n-1 



Ops 
(flops) 



Total # of 
times executed 



2b e n A 



2„2 



2bi,n 



2b%n 



pip +l)/2 

cr) 



p(p+l)(p + 2) 

rr 2 ) 



Total Cost: £ (^V^ f + d ) ) « 2pn 3 +p 2 n 2 + ^ flops 

d=0 ^ d + 1 ' 3 

1 

Total temporary storage: bf>n 2 + b\n = (^e +lra2_d ) cn t r i cs 



Fig. 3.4. Algorithm-by-blocks for computing [.A; X, X,X], The algorithm assumes that C is 
partitioned into blocks of size bp X bp X bp, with p = [p/bel ■ 
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symmetric tensor as is C by virtue of the operation applied to A.. Now, 



n-1 

_ a ^ T ^ T — \^ / a ^ T \ 

Ijahte ~ JxXq Xj Xi Xj l X2 Xj 2 — 2_^ / \ J ^ X l X ji X 2 x j 2 )io X Xjoio 

io=0 

n — 1 n— 1 

= E(E( A X2 X lXji'*i)*o XoXjoio 

io— ii— 
n — 1 n—1 n — 1 



12—0 zi— io— 



4.2. Simple algorithms. A naive algorithm is given in Figure 3.1 (middle- left). 
The cheaper algorithm to its right is motivated by 



lx XxiXx 2 X = ilxo 



\ x p-l 



t 



\ x p-l 



X 2 



X() 



frpP) T (2) 
\^ - l " ' 
s 

j(2) \*p-l. 

/ t (1) ••• t (1) \ 
L 00 0(p-l) 



X() 



Vt (1) •••t (1) / 



r(i) 



Xl 



L p-1 



where T 
t (1) G 

U»2 



(2) 



Ax,9 T t (1) 



T| 2 2) Xi xl = A x 2 xf 2 Xl $1, and t£> g R nxn , 



Although it appears T^ 2 -* is a vector (oriented across the page) of matrices, it is 
actually a vector (oriented into the page) of matrices (each T\ 2 should be viewed 
a matrix oriented first down and then across the page). Similarly, CT*- 1 -* should be 
viewed as a matrix (first oriented into the page, then across the page) of vectors 
(each t^ 1 ] should be viewed as a vector oriented down the page). This is a result of 
the mode-multiplication and is an aspect that is difficult to represent on paper. 

This algorithm requires p(2n 3 +p(2n 2 + 2pn)) = 2pn 3 + 2p 2 n 2 + 2p 3 n flops at the 
expense of requiring workspace for a matrix T of size p x n and vector t of length n. 

4.3. Blocked Compact Symmetric Storage (BCSS). In the matrix case 
(Section^, we described BCSS, which stores only the blocks in the upper triangular 
part of the matrix. The storage scheme used in the 3-way case is analogous to the 
matrix case; the difference is that instead of storing blocks belonging to a 2-way upper 
triangle, we must store the blocks in the "upper triangular" region of a 3-way tensor. 
This region is comprised of all indices (107*17*2) where io < i\ < ii. For lack of a 
better term, we refer to this as upper hypertriangle of the tensor. 

Similar to how we extended the notion of the upper triangular region of a 3- 
way tensor, we must extend the notion of a block to three dimensions. Instead of a 
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TO 


= 3 
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T7 3 


TO 


= d 


rn 




77 rf 



Fig. 4.1. Storage requirements for a tensor A. under different storage schemes. 



block being a two-dimensional submatrix, a block for 3-way tensors becomes a 3-way 



subtensor. Partition tensor A € 



into cubical blocks of size bji x bji x bji : 



( 



A., = 



-4-000 

Aiqq 



/ 



-1)00 



■Ao(ri- 
"l(n- 



1)0 
1)0 



A. 



(n-l) 



■A(n-l)10 ' " ' ^l(n-l)(n-l)0 / 



00(n-l) 



1) 



•A-10(fi-l 



-^-Ol(ft-l) 
■^ll(n-l) 



0(n-l)(n-l) 



•™l(n- 



(n-l)(n-l) 



\^l(n-l)0(fi-l) ^-(B-l)l(n-l) ' ' ' •A(H-l)(ft-l)(fi-l) / 



where A iii 2 ^ s ^ x ^ x ^ (except, possibly, the fringe blocks) and n = \n/bjC\. 
These blocks are stored using some conventional method and the blocks lying out- 
side the upper hypertriangular region are not stored. Once again, we will not take 
advantage of any symmetry within blocks (blocks with lo = ii, Iq = or T\ =12) to 
simplify the access pattern when computing with these blocks. 

Summarized in Figure |4.1[ we see that while storing only the upper hypertri- 

fn + 2\ 

angular elements of the tensor A requires I storage, one has to now store 



b\ ( ^ ) oli'iHODts. However, since 



77 3 , , 77 3 

3! 6 - = - 



, we achieve a sav- 



7l + 2^ 

3 J UA 3! 6 
ings of approximately a factor 6 (if 77 is large enough) relative to storing all elements. 
Once again, we can apply the same storage method to G for additional savings. What 
we have described is the obvious extension to BCSS, which is what we will call it for 
higher order tensors as well. 

Blocks for which the index criteria To — i\ — 12 is NOT met yet still contain 
some symmetry are referred to as partial-symmetric blocks. Taking advantage of 
partial-symmetry in blocks is left for future research. 

4.4. Algorithm-by-blocks. We now discuss an algorithm-by-blocks for the 3- 
way case. Partition C and A into b% and b\ blocks, respectively, and partition X 
into be x bji blocks. Then, extending the insights we gained from the matrix case, 
6 := [A; X, X, X] means that 



n— 1 n — 1 n—1 



e 7 



y^ y^ ^Uoma x o Xj Ot0 x x x J:m x 2 x^ 



lO — l±— 22—0 



n—1 n — 1 n—1 

y 1 y ^ y 1 [AiQt^ ; Xj j; Q , Xj^ , Xj 

1 — 22—0 
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This yields the algorithm in Figure |3.4[ in which an analysis of its cost is also given. 
This algorithm avoids redundant computation, except for within blocks of C which 
are symmetric or partially-symmetric (Jo = 3i or ji — 32 or Jo — 32)- The algorithm 
now requires ben 2 + b e n extra storage (for and 7^, respectively), in addition 
to the storage for C and A. 

5. The m-way Case. We now generalize to tensors C and A of any order. 

5.1. The operation for order-m tensors. Letting m be any non-negative 
integer value yields 6 := [A;X,X,--- ,X] where A € R [m <" ] , C € R [m < p \ and 
[A; X, X, •••X] = yixoXxiXx2-- - X m -i X. In our discussion, A is a symmetric 
tensor as is C by virtue of the operation applied to A. 

Let 7j j 1 ...j m _ 1 denote the (jo, ji, . . . , j m -i) element of the order-m tensor C 
Then, by simple extension, we find that 

Tj'oj'i— jm-i = A X X J X l x j 1 X 2 •• ■ x m-l X J rn _ 1 
n—1 n—1 



E 



,_ 1= io=0 



a ioii—im-iXjoioXjiii ' ' ' Xjm-lin 



5.2. Simple algorithms. A naive algorithm with a cost of (m + l)p m n m flops 



is now given in Figure 3.1 (bottom- left). By comparing the loop structure of the naive 
algorithms in the 2-way and 3-way cases, the pattern for a cheaper algorithm (one 
which reduces flops) in the m-way case should become obvious. Extending the cheaper 



algorithm in the 3-way case suggests the algorithm given in Figure 3.1 (bottom-right), 
This algorithm requires 



m—l 



p(2n 3 +p(2n 2 + 2pn)) = 2pn m + 2p 2 n m ~ 1 + ■■■ 2p m - 1 n = 2 V p l+1 n m - 1 flops 



i=0 



at the expense of requiring workspace for tensors of order 1 through to — 1. 

5.3. Blocked Compact Symmetric Storage (BCSS). We now further ex- 
tend BCSS for the general TO-way case. The upper hypertriangular region now con- 
tains all indices (io, ii, . . . , i m -\) where i$ < i\ < . . . < i m -\- Using the 3-way case 
as a guide, one can envision by extrapolation how a block partitioned order-m tensor 
looks. The tensor A € Iftl" 1 '™! is partitioned into hyper-cubical blocks of size bj\. The 
blocks lying outside the upper hypertriangular region are not stored. Once again, we 
will not take advantage of symmetry within blocks. 



Summarized in Figure 4.1 while storing only the upper hypertriangular elements 

(ii -|- fyi — \\ ( ti ~\~ in — 1 \ 

) storage, one has to now store ) bj\ 

my \ m J 

elements which potentially achieves a savings factor of m! (if n is large enough). 

'n + m — 1\ n 



Although the approximation ( }bji ~ — r ^ s use d> the lower-order 

\ m J ml 

terms have a significant effect on the actual storage savings factor. Examining Fig- 
ure [5Tj we see that as we increase m, we require a significantly larger n to have the 
actual storage savings factor approach the theoretical factor. While this figure only 
shows the results for a particular value of bj±, the effect applies to all values of bji. 

5.4. Algorithm-by-blocks. Given that C and A are going to be stored us- 
ing BCSS, we now need to discuss how to compute with these blocks. Assume the 
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Fig. 5.1. Actual storage savings of BCSS on A. 



Algorithm 

for j m _i = 0, . . . ,p — 1 

T (m-l) : =Ax m _i (X Jm _ l0 



X J m -i(«-i) ) 



for ji = 0, . . . , j 2 

T(D := T(2) Xl (X Jl0 ■■■ X,, («_!)) 
for Jo = 0, . . . , Ji 

T (1) x ( X Jo o ■ ■ ■ X Jo(fi _ 1) ) = 

XO ( X J ° ' ' ' X Jo<""!) ) 



endfor 
endfor 



endfor 



Ops 
(flops) 



2b e n r ' 



2b™~ L n 2 



2bTn 



Total # of 
times executed 



p + m — 1 
m 



Temp. 

storage 

ben" 1 " 1 



Total Cost: ]P (26; 



h d+l„m-d{P + d 
d+1 



flops 



Total additional storage: ^ ff>g +1 n m ~ 1 ~ ti ') floats 



Fig. 5.2. Algorithm-by-blocks for computing C := [A;X, ••• , X]. T/ie algorithm assumes that 
6 is partitioned into blocks of size W£ , with p = [p/fcel • 



partioning discussed above, with an obvious indexing into the blocks. Now, 

n— 1 n — 1 

^JoJl-'-jm-l = y , ' ' ' ^ ] -^ljo«l X -^-Jo«o X l -^-JlU x 2 ' ' ' x m-l Xj m -l?m-i 



zo— z m _i— 
ft-1 ft-1 

20—0 i m -i- 
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Fig. 5.3. Memory requirements for various objects used. 



This yields the algorithm given in Figure |5.2[ which avoids much redundant com- 
putation, except for within blocks of C which are symmetric or partially-symmetric 
(Ji = 3k and i ^ k). 



5.5. Memory requirements. The algorithm described in Figure [572] utilizes a 
set of temporaries. At its worst, the sttsm operation acting on a tensor A G M\ m,n ^ 
forms a temporary 'j( m ~ 1 ) with be entries [^] for a chosen block dimension be- 

The algorithm requires additional temporaries CT^ m_2 ',-- - ,7^; however, these are 
significantly smaller than cr*-" 1-1 ' and therefore we ignore their space requirements in 
this analysis. We store A with BCSS with block size bji and we store all elements 
of T (m_1) ,--- ,T (1) . There may be partial symmetries in T^ m_1 \ • • • but we 

do not tak e advantage of these here. The storage requirements are then given in 



5.3 



fl _|_ jyi — J\ 7i m 

where we use the approximation [ ] w — - when m n. 

m J ml 



Figure 

It is interesting to ask the question how parameters have to change with m, 
as m increases, in order to maintain a constant fraction of space required for the 
temporaries. For example, if we wish to keep memory required for CT^ m_1 ^ constant 
relative to the memory required for A, then the fraction 

n m - x be n m ~ x be b e m\ 



(n+m-l\ i m 
\ m )°A 



must be kept approximately constant. Hence, if n is constant and m increases, be 
must be chosen inversely proportional to ml. This is disconcerting, because the larger 
be, the less benefit one derives from BCSS. On the other hand, if we look at this more 
optimistically, and allow for a fraction of the dense storage for the workspace, then 



n m— 1 



b e be 



and be can be kept constant as m increases. 

Because we are storing symmetric tensors hierarchically, in practice, we require 
additional storage for meta-data (implementation-specific data that is unrelated to 
underlying mathematics) associated with each block of the tensors we store with 
BCSS. Although this does impact the overall storage requirement, as this is an im- 
plementation detail we do not thoroughly discuss it here. The impact of meta-data 
storage on the proposed algorithm is discussed in Appendix [B] and is shown to be 
minimal. 



5.6. Computational cost. In Figure 5.2 an analysis of the computational cost 



of computing intermediate results is given, as well as the overall cost. Notice that 



is an m-way tensor where each mode is of dimension n except the last which is of 

dimension fee 
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if the algorithm did not take advantage of symmetry, the range for each nested loop 
becomes 0, . . . ,p— 1, meaning that each nested loop is executed p times, and the total 
cost becomes 



m—1 m—1 



J2 {2b d e +1 n m - d p d+1 ) = Yl (2« m "V +1 ) flops. 

d=Q d=0 

A simple formula for the speedup achieved by taking advantage of symmetry is non- 
trivial to establish in general from the formula in Figure |5.2| and the above formula. 
However, when rfCpwe find that 

v^m-l 9„m-d„rf+l v-im-1 m -d n d+l m -d n d+l 

l^d=0 z,i P ^ l^d=0 11 P _ l^d=0 11 P 



V" 71-1 Oh d+1 n m - d ( p+d \ y™- 1 Ud+l^m-d P d+1 V"- 1 „m-d P d+1 

2^d=o zo e 71 \d+i) 2^d=o °e n id+iv. 2^d=< 



(and, if n = p,) 



(d+iy. ^d=o " (d+iy. 

m-l „m+l 



Em— 
d=0 



III 



Em— 1 n m + 1 sr^m — 1 1 

d=0 (d+l)\ 2^d=0 



(3+1)! ^d=0 (d+T) 



m— 1 



Since 1 < — — - — -r < — — - — rr = e — 1 we establish that 
~ ^ (d + l)\ ^ (d+l)l 

d=0 v ; d=0 v ; 

m m 
0.58m < < —m-l i — - m - 



Em-. 
d=0 



(d+1) 



approximate speedup 
from exploiting symmetry 
when n = p and m <^ip 

As the above shows, if we take advantage of symmetry when applying the sttsm 
operation to an order-m tensor, we can reduce the number of required flops by a factor 
between 0.58m and m over the approach examined which does not take advantage of 
symmetry. 

5.7. Permutation overhead. Tensor computations are generally implemented 
by a sequence of permutations of the data in the input tensor A., interleaved with 
calls to a matrix-matrix multiplication implemented by a call to the BLAS routine 
dgemm, details of which can be found in Appendix [A) These permutations constitute 
a nontrivial overhead, as we will discuss now. 

If one does NOT take advantage of symmetry, then a series of exactly m tensor- 
matrix-multiply operations are required to be performed on A, each resizing a single 
mode at a time. This means the cost associated with permutations for the dense case 
is 

m— 1 

+ 2— ^ p d n m ~ d memory operations (memops). 

77 d=0 

When BCSS is used, the permutations and multiplications happen for each tensor 
computation with a block. Figure |5.4| presents an analysis of the cost of permuting 
data for each tensor-matrix-multiply operation, as well as the total cost of performing 
permutations of the data. We see that the proposed algorithm-by-blocks requires 
significantly more memops to perform the permutations since each block of A is 
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times executed 


for j m _i =0,...,p-l 

rj-(m-l) ._ 






•A X TO _1 ( Xj m _ l0 ' ' ' Xj m _ 1 ( s _ 1 ) ) 


n m + 2b e n m - 1 


(0 


for Ji = 0, . . . ,j 2 

y :=?- (2) X! (X Jl0 ■■■ X Jl(B _ 1} ) 

1U1 JO U J • • • ! Jl 

e- - - — 

x (X Jo0 • • • X Jo(fi _ 1) ) 
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{ m-1 ) 

/p + ra - 1\ 
\ m J 


endfor 
endfor 






endfor 







Total Cost: 
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7 d=0 
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mem ops 



Fig. 5.4. Algorithm-by-blocks for computing C := [A; X, • • • , X]. T/ie algorithm assumes that 
6 is partitioned into blocks of size b 7 ^ , with p = [p/feel • 



(i+2S)Er 1 A 



permuted [ ^ ^ ) times. The required cost becomes 



(i + 2£)£SV- 



(i+2^)E;;o"e 

(if, further, n = p) 

This yields 

1.74m 

< 



(d+l)! 

3m 



\ L "t z n J 2^d=o (d+i)i 



mcmops. 



3m 



(p + 2) (p + 2)(e-l) 



< 



3m 



(p+2)Er 1 (^T) 



< 



3m 
(P + 2)" 



approximate speedup 
from exploiting symmetry 
when n = p and m«p 

As the above shows, if p is chosen too large, the algorithm will require more memops 
in the form of permutations, unless m is scaled accordingly. However, m is typically 
defined by the problem and therefore is not adjustable. 

We can mitigate the permutation overhead by adjusting be. For example, by 
letting p = 2 (be = \p/2~}), we are able to take advantage of symmetry in G for 
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Fig. 5.5. Comparison of dense to BCSS algorithms. Solid and dashed lines correspond to dense 
and BCSS, respectively. From left to right: storage requirements; cost from computation (flops); cost 
from permutations (memops). For these graphs 6/j = &g = 8. 
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Fig. 5.6. Comparison of dense to BCSS algorithms. Solid and dashed lines correspond to dense 
and BCSS, respectively. From left to right: storage requirements; cost from computation (flops); cost 
from permutations (memops). Here fi = p = 2. 



storage reasons (as be ^ 1) while limiting the increase in memops required (relative 
to that encounted when dense storage is used) to 

„ 3to 3m 

0.43m < — — < — = — < 0.75m. 

It is tempting to discuss an optimal choice for C. However, it would be better to focus 
on greatly reducing this overhead, as we will discuss in the conclusion. 

5.8. Summary. Figures pT5f|5.6| illustrate the insights discussed in this section. 
The (exact) formulae developed for storage, flops, and memops are used to compare 
and contrast dense storage with BCSS. 
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In Figure 5.5 the top graphs report storage, flops, and memops (due to permu- 
tations) as a function of tensor dimension (n), for different tensor orders (m), for 
the case where the storage block size is relatively small (64 = be = 8). The bottom 
graphs report the same information, but as a ratio. The graphs illustrate that BCSS 
dramatically reduces the required storage and the proposed algorithms reduce the 
flops requirements for the sttsm operation, at the expense of additional memops due 
to the encountered permutations. 

In Figure |5.6| a similar analysis is given, but for the case where the block size is 
half the tensor dimension (i.e., n — p = 2). It shows that the memops can be greatly 
reduced by increasing the storage block dimensions, but this then adversely affects 
the storage and computational benefits. 

It would be tempting to discuss how to choose an optimal block dimension. How- 
ever, the real issues is that the overhead of permutation should be reduced and/or 
eliminated. Once that is achieved, in future research, the question of how to choose 
the block size becomes relevant. 

6. Experimental Results. In this section, we report on the performance at- 
tained by an implementation of the discussed approach. It is important to keep in 
mind that the current state of the art of tensor computations is first and foremost 
concerned with reducing memory requirements so that reasonably large problems can 
be executed. This is where taking advantage of symmetry is important. Second to 
that is the desire to reduce the number of floating point operations to the minimum 
required. Our algorithms perform the minimum number of floating point operation 
(except for a lower order term that would result from taking advantage of partial 
symmetry in the temporary results). Actually attaining high performance, like is 
attained for dense matrix computations, is still relatively low priority compared to 
these issues. 

6.1. Target architecture. We report on experiments on a single core of a Dell 
PowerEdge R900 server consisting of four six-core Intel Xeon 7400 processors and 96 
GBytes of memory. Performance experiments were gathered under the GNU/Linux 
2.6.18 operating system. Source code was compiled by the GNU C compiler, ver- 
sion 4.1.2. All experiments were performed in double-precision floating-point arith- 
metic on randomized real domain matrices. The implementations were linked to the 
OpenBLAS 0.1.1 library [HHB], a fork of the GotoBLAS2 implementation of the 
BL AS [HI E] • As noted, most time is spent in the permutations necessary to cast 
computation in terms of the BLAS matrix-matrix multiplication routine dgemm. Thus, 
the peak performance of the processor and the details of the performance attained by 
the BLAS library are mostly irrelevant at this stage. The experiments merely show 
that the new approach to storing matrices as well as the algorithm that takes advan- 
tage of symmetry has promise, rather than making a statement about optimality of 
the implementation. Much room for improvement remains. 

6.2. Implementation. The implementation was coded in a style inspired by 
the libf lame library US]. An API similar to the FLASH API [T7] for storing 
matrices as matrices of blocks and implementing algorithm-by-blocks was defined and 
implemented. Computations with the (tensor and matrix) blocks were implemented 
as the discussed sequence of permutations interleaved with calls to the dgemm BLAS 
kernel. No attempt was yet made to optimize these permutations. However, an 
apples-to-apples comparison resulted from using the same sequence of permutations 
and calls to dgemm for both the experiments that take advantage of symmetry and 
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Fig. 6.1. Experimental results when n = p = 16, bji = &e = 8 and the tensor order, m, 
is varied. For m = 8, storing A. and 6 without taking advantage of symmetry requires too much 
memory. 




tensor dimension (n=p) tensor dimension (n=p) tensor dimension (n=p) 



Fig. 6.2. Experimental results when the order m = 5, 64 = fcg = 8 and the tensor dimensions 
n = p are varied. For n = p = 72, storing A and C without taking advantage of symmetry requires 
too much memory. 
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Fig. 6.3. Experimental results when the order m = 5, n = p = 64 and the block dimensions 
t>A = be cire varied. 



those that store and compute with the tensor densely, ignoring symmetry. 

6.3. Results. Figures |6.1f|6.3| show results from executing our implementation 
on the target architecture of the algorithms which do and do not take advantage of 
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Fig. 6.4. Comparison of computational cost to permutation cost where m = 5, n = p = 64 and 
tensor block dimension (bji, bg) is varied. The solid line represents the number of flops (due to 
computation) required for a given problem (left axis), and the dashed line represents the number of 
memops (due to permutation) required for a given problem (right axis). 



symmetry. The first is denoted a BCSS algorithm while the latter a dense algorithm. 
All figures show comparisons of the execution time of each algorithm, the associated 
speedup of the BCSS algorithm over the dense algorithm, and the estimated storage 
savings factor of the BCSS algorithm. 

For the experiments reported in Figure 6.1 we fix n, p, bji, be, and vary the tensor 
order to. The BCSS algorithm begins to outperform the dense algorithm after the 
tensor order is equal to or greater than 4. This is due to the fact that at lower tensor 
orders, the speedup of the BCSS algorithm is lower, and therefore any computational 
overhead introduced due to our implementation of the BCSS algorithm is more pro- 
nounced (increasing the execution time of the BCSS algorithm). Additionally, notice 
that that BCSS allows larger problems to be solved; the dense algorithm was unable 
to store the entire symmetric tensor when m = 8. We see that once the problem-size 
becomes larger (to = 7), the speedup of the BCSS algorithm over the dense algorithm 
decreases. This is surprising since, according to the mathematics defining this opera- 
tion, we would expect a continual increase in speedup (computational savings of the 
BCSS algorithm is proportional to the order of the tensor). We believe this effect to 
be due to an implementation detail that we plan to investigate in future work. 

In Figure |6T2| we fix to, bji, and be an d vary the tensor dimensions n and p. We see 
that the BCSS algorithm outperforms the dense algorithm and attains a noticeable 
speedup. The speedup appears to level out as the tensor dimension increases which 
is unexpected. We will discuss this effect later as it appears in other experiments. 

In Figure [6~3| we fix to, n, and p, and vary the block sizes bj± and be- The right- 
most point on the axis corresponds to the dense case (as bji = n = be = p) and 
the left-most point corresponds to the fully-compact case (where only unique entries 
are stored). There now is a range of block dimensions for which the BCSS algorithm 
outperforms the dense algorithm. Similar to previous result, as we increase the block 
size, our relative speedup decreases which we now discuss. 

In Figures |6.2|^6.3| we saw that as tensor dimension or block dimension increased 
the relative speedup of the BCSS algorithm over the dense algorithm lessened/leveled 
out. We believe this is due to a balance between the cost of permutations and the 
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cost of computation. In Figure 6.4 we illustrate (with predicted flop and memop 
counts) that if we pick a small block dimension, we dramatically increase the memops 
due to permutations while dramatically decreasing the computational cost. If instead 
we pick a large block dimension, we dramatically lower the permutation cost, while 
increasing the computational cost. Although this plot only shows one case, in general 
this tension between memops due to permutations and flops due to computation 
exists. In the large block dimension regime we are performing fewer permutations 
than computations which is beneficial since memops are significantly more expensive 
than flops. 

We believe that the effect of the small block dimension region (which dramatically 



increases memops) explains the dropping/leveling off of relative speedup of Figure 6.2 
where the tensor dimension becomes large while the block dimension remains constant 
and thus the relative block dimension becomes small. Further, we believe that the 
effect of the large block dimension region which dramatically increases flops explains 



the dropping/leveling off of relative speedup of Figure 6.3 where the block dimension 
increases while the tensor dimension remains constant (the block dimension becomes 
relatively large). This, and other effects related to block size, will be studied further 
in future research. 

7. Conclusion and Future Work. We have presented storage by blocks, BCSS, 
for tensors and shown how this can be used to compactly store symmetric tensors. 
The benefits were demonstrated with an implementation of a new algorithm for the 
change-of-basis (sttsm) operation. Theoretical and practical results showed that both 
the storage and computational requirements were reduced relative to storing the ten- 
sors densely and computing without taking advantage of symmetry. 

This initial study has exposed many new research opportunities, which we believe 
to be the real contribution of this paper. We finish by discussing some of these 
opportunities. 

Optimizing permutations. In our work, we made absolutely no attempt to opti- 
mize the permutation operations. Without doubt, a careful study of how to organize 
these permutations will greatly benefit performance. It is likely that the current im- 
plementation not only causes unnecessary cache misses, but also a great number of 
Translation Lookaside Buffer (TLB) misses [12], which cause the core to stall for a 
hundred or more cycles. 

Optimized kernels/avoiding permutation. A better way to mitigate the permuta- 
tions is to avoid them as much as possible. If n = p, the sttsm operation performs 
0(ri m+1 ) operations on 0{n m ) data. This exposes plenty of opportunity to optimize 
this kernel much like dgemm, which performs 0(n 3 ) computation on 0{n 2 ) data, is 
optimized. For other tensor operations, the ratio is even more favorable. 

We are developing a BLAS-like library, BLIS [35], that allows matrix operations 
with matrices that have both a row and a column stride, as opposed to the traditional 
column-major order supported by the BLAS. This means that computation with a 
planar slice in a tensor can be passed into the BLIS matrix-matrix multiplication 
routine, avoiding the explicit permutations that must now be performed before calling 
dgemm. How to rewrite the computations with blocks in terms of BLIS, and studying 
the performance benefits, is a future topic of research. 

One can envision instead creating a BLAS-like library for the tensor operations we 
perform with blocks. One alternative for this is to apply the techniques developed as 
part of the PHiPAC [7J, TCE, SPIRAL [21 , or ATLAS [27] projects to the problem 
of how to optimize computations with blocks. This should be a simpler problem 
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than optimizing the complete tensor contraction problems that, for example, TCE 
targets now, since the sizes of the operands are restricted. The alternative is to create 
microkernels for tensor computations, similar to the microkernels that BLIS defines 
and exploits for matrix computations, and to use these to build a high-performance 
tensor library that in turn can then be used for the computations with tensor blocks. 

Algorithmic variants for the sttsm operation. For matrices, there is a second 
algorithmic variant for computing C := XAX T . Partition A by rows and X by 
columns: 

x„_i ) . 



Then 

C = XAX T =(x ••• x„_! ) : J X T =x (a^X)+..-x n _ 1 (a£_ 1 X). 

V S n-1 / 

We suspect that this insight can be extended to the sttsm operation, yielding a 
new set of algorithm-by-blocks that will have different storage and computational 
characteristics. 

Extending the FLAME methodology to multi-linear operations. In this paper, we 
took an algorithm that was systematically derived with the FLAME methodology for 
the matrix case and then extended it to the equivalent tensor computation. Ideally, 
we would derive algorithms directly from the specification of the tensor computation, 
using a similar methodology. This requires a careful consideration of how to extend 
the FLAME notation for expressing matrix algorithms, as well as how to then use 
that notation to systematically derive algorithms. 

Partial symmetry: storage and computation. The presented algorithm, which 
takes advantage of symmetry for computational purposes only, does so with respect 
to the final output C However, the described algorithm forms temporaries which 
contain partial-symmetries of which the proposed algorithm does not take advantage. 
Therefore, unnecessary storage is required and unnecessary computation is being per- 
formed when forming temporaries. Moreover, as numerous temporaries are formed, 
one can argue that taking advantage of partial symmetries in temporaries is almost 
as important as taking advantage of symmetry in the input and output tensors. 

So far, we have only spoken about exploiting partial symmetries at the block 
level. However, each block may contain partial-symmetries which can be exploited 
for computational or storage reasons. It is difficult to say whether taking advantage 
of partial-symmetries within each block is advantageous, but it should be examined. 

Multithreaded parallel implementation. Multithreaded parallelism can be accom- 
plished in a number of ways. 

• The code can be linked to a multithreaded implementation of the BLAS, 
thus attaining parallelism within the dgemm call. This would require one to 
hand-parallelize the permutations. 

• Parallelism can be achieved by scheduling the operations with blocks to 
threads much like our SupcrMatrix [22 runtime does for the libf lame li- 
brary, or PLASMA [1[ does for its tiled algorithms. 
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and X = ( xq 



We did not yet pursue this because at the moment the permutations are a bottleneck 
that, likely, consumes all available memory bandwidth. As a result, parallelization 
does not make sense until the cost of the permutations is mitigated. 

Exploiting accelerators. In a number of papers [181 114] , we and others have shown 
how the algorithm-by-blocks (tiled algorithm) approach, when combined with a run- 
time system, can exploit (multiple) GPUs and other accelerators. These techniques 
can be naturally extended to accommodate the algorithm-by-blocks for tensor com- 
putations. 

Distributed parallel implementation. Once we understand how to derive sequential 
algorithms, it becomes possible to consider distributed memory parallel implementa- 
tion. It may be that our insights can be incorporated into the Cyclops Tensor Frame- 
work 23 , or that we build on our own experience with distributed memory libraries 
for dense matrix computations, the PLAPACK [24] and Elemental [20] libraries, to 
develop a new distributed memory tensor library. 

General multi-linear library. The ultimate question is, of course, how the insights 
in this paper and future ones can be extended to a general, high-performance multi- 
linear library, for all platforms. 
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Appendix A. Casting Tensor-Matrix Multiplication to BLAS. Given a 
tensor A £ ]R /oX '" x/ ™-i , a mode k, and a matrix B £ M ,/x/fc , the result of mul- 
tiplying B along the fc-th mode of A is denoted by C = A Xk B, where 6 £ 
R /„x...x/ t _ 1 xjx/ H1 x...x/ m _ 1 and each element of e is defined as 



This operation is typically computed by casting it as a matrix-matrix multiplication 
for which high-performance implementations are available as part of the Basic Linear 
Algebra Subprograms (BLAS) routine dgemm. 

The problem viewing a higher-order tensor as a matrix is analogous to the problem 
of viewing a matrix as a vector. We first describe this simpler problem and show how 
it generalizes to objects of higher-dimension. 

Matrices as vectors (and vice-versa). A matrix A £ jj mxn can De viewed as a 
vector a £ R M where M — mn by assigning aj +j lTO = A^. (This is analogous 
to column-major order assignment of a matrix to memory.) This alternative view 
does not change the relative order of the elements in the matrix, since it just logically 
views them in a different way. We say that the two dimensions of A are merged or 
"grouped" to form the single index of a. 

Using the same approach, we can view a as A by assigning the elements of A 
according to the mentioned equivalence. In this case, we are in effect viewing the 
single index of a as two separate indices. We refer to this effect as a "splitting" of the 
index of a. 

Tensors as matrices (and vice-versa). A straightforward extension of grouping of 
indices allows us to view higher-order tensors as matrices and (inversely) matrices as 
higher-order tensors. The difference lies with the calculation used to assign elements 
of the lower/higher-order tensor. 

As an example, consider an order-4 tensor G £ r 7 ox/ix/ 2 x; 3 ^ ^ e can v j ew g as 
a matrix C £ R j ° xJi where Jo = Io x h and J\ = I2 x I3. Because of this particular 
grouping of indices, the elements as laid out in memory need not be rearranged (rela- 
tive order of each element remains the same). This follows from the observation that 
memory itself is a linear array (vector) and realizing that if C and G are both mapped 
to a 1-dimensional vector using column-major order and its higher dimensional ex- 
tension (which we will call dimensional order), both will be stored identically. 

The need for permutation. If we wished to instead view our example 
G £ R'ox/ix^xia ag a matrix C e K ./oxJ! where; for msta nce, J = h and J x = 

Io x I2 x I3, then this would require a rearrangement of the data since mapping C and 
6 to memory using dimensional order will not generally produce the same result for 
both. This is a consequence of changing the relative order of indices in our mappings. 

This rearrangement of data is what is referred to as a permutation of data. By 
specifying an input tensor A £ g-fox-x/m-i an( j ^ ne dggjj-gd permutation of in- 
dices of A, ir, we define the transformation G = permute(Jl, n) that yields C £ 
R 7 "o x/ "i x -" x7 "m-i go that Gi' ...i' = Ai ...i m _ 1 where i' corresponds to the result of 
applying the permutation tt to i. The related operation ipermute inverts this transfo- 

/ _1 X/ _i X---X/ -1 

mation when supplied tt so that G = ipermute(A, w) yields G £ R "° ' 1 " m - 1 
where = Ai„...i m _ 1 where i' corresponds to the result of applying the per- 

mutation 7r _1 to i. 
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Fig. B.l. Le/t: Storage requirements of A £ Rl m > 64 1 as block dimension changes. Right: Storage 
requirements of A £ IK' 5,64 ] for different choices for the meta-data stored per block, measured by k, 
as block dimension changes. 



Casting a tensor computation in terms of a matrix-matrix multiplication. We can 
now show how the operation G = A x k B, where A G R^x-xJ™-^ B g ]R Jx/ '= i and 
6 6 jj-fox • x/ fe _ix Jx7 fc+1 x - x/ m _^ can k e cagt ag a ma t r ix-matrix multiplication if the 
tensors are appropriately permuted. The following describes the algorithm: 

1. Permute: Tji permutc(^l, {k, 0, . . . , k — 1, k + 1, . . . , m — 1}). 

2. Permute: :P e <- permute(C, {k, 0, . . . , k — 1, k + 1, . . . , to - 1}). 

3. View tensor J"^ as matrix A: A D 5 ^, where A e T R /fcX ' 71 and Ji = 

4. View tensor J"e as matrix C: C <- J"e, where C e M' 7xJl and Ji = 

-?0 " " " Ik-lh+l ■ ■ ■ Im-1- 

5. Compute matrix-matrix product: C := BA. 

6. View matrix C as tensor Te- J'e <— C, 
where 3> e € R-' ! <^x-x; t -ix/ t+ ix-x/ m _ 1> 

7. "Unpermute" : 6 ipermute(J > e, {k, 0, . . . , k — 1, fc + 1, . . . , m — 1}). 

Step 5. can be implemented by a call to the BLAS routine dgemm, which is typically 
highly optimized. 

Appendix B. Design Details. 

We now give a few details about the particular implementation of BCSS, and how 
this impacts storage requirements. Notice that this is one choice for implementing 
this storage scheme in practice. One can envision other options that, at the expense 
of added complexity in the code, reduce the memory footprint. 

BCSS views tensors hierarchically. At the top level, there is a tensor where 
each element of that tensor is itself a tensor (block). Our way of implementing this 
stores a description (meta-data) for a block in each element of the top-level tensor. 
This meta-data adds to memory requirements. In our current implementation, the 
top-level tensor of meta-data is itself a dense tensor. The meta-data in the upper 
hypertriangular tensor describes stored blocks. The meta-data in the rest of the top- 
level tensor reference the blocks that correspond to those in the upper hypertriangular 

24 



tensor (thus requiring an awareness of the permutation needed to take a stored block 
and transform it). This design choice greatly simplifies our implementation (which 
we hope to describe in a future paper). We will show that although the meta-data 
can potentially require considerable space, this can be easily mitigated. We will use 
A for example purposes. 

Given A G M}" 1 ' 71 ^ stored with BCSS with block dimension bji, we must store 
meta-data for n m blocks where n — \n/bj\\ . This means that the total cost of storing 
A with BCSS is 

Cstorage(^) = kfi m + b% + ™ ~ ^ floats, 

A; is a constant representing the amount of storage required for the meta-data asso- 
ciated with one block, in floats. Obviously, this meta-data is of a different datatype, 
but floats will be our unit. 

There is a tradeoff between the cost for storing the meta-data and the actually 
entries of A, parameterized by the blocksize bj±: 

• If bji = n, then we only require a minimal amount of memory for meta-data, 
k floats, but must store all entries of A since there now is only one block, and 
that block uses dense storage. We thus store slightly more than we would if 
we stored the tensor without symmetry. 

• If bji = 1, then n = n and we must store meta-data for each element, meaning 
we store much more data than if we just used a dense storage scheme. 

Picking a block dimension somewhere between these two extremes results in a smaller 
footprint overall. For example, if we choose a block dimension bji = yn, then n = y/n 
and the total storage required to store A with BCSS is 

„ , _ m im (n + m-l\ , m rnfn^+m — l 

Cstorage(^) = kn m + 6^ = kn 2 + 71 2 

V m J \ m 

kn 1 ^ + bt I — _ I = n? [ k H I floats, 



ml J \ m! 



which, provided that k <C — — , is significantly smaller than the storage required for 

the dense case (n m ). This discussion suggests that a point exists that requires less 
storage than the dense case (showing that BCSS is a feasible solution). 

In Figure |B.1[ we illustrate that as long as we pick a block dimension that is 
greater than 4, we avoid incurring extra costs due to meta-data storage. It should be 
noted that changing the dimension of the tensors also has no effect on the minimum, 
however if they are too small, then the dense storage scheme may be the minimal 
storage scheme. Additionally, adjusting the order of tensors has no real effect on the 
block dimension associated with minimal storage. However increasing the amount 
of storage allotted for meta-data slowly shifts the optimal block dimension choice 
towards the dense storage case. 
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